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SPECIFICATION 



1 TITLE OF THE INVENTION 



Speech Recognition and Signal Analysis by Exact Fast Search of Subsequences with Maximal 
Confidence Measure 



2 REFERENCE TO APPENDIX SUBMITTED ON CD 

Not Applicable 

3 CROSS-REFERENCE TO RELATED APPLICATION 

This patent application has as parent applic ation the patent application C99-00214/25.02.1999 
registered with the State Office for Inventions and Trademarks (OSIM) in Bucharest, Ro- 
mania. The present application is the US national stage of the international application 
PCT/IB00/00189 registered with the International Patent Office in Geneva. 

4 BACKGROUND OF THE INVENTION 



4.1 FIELD OF THE INVENTION 



The invention relates to a common component of: 



• Speech Recognition, f] jmore particularly to the fields of- Keyword Spotting f] -and 
decoding^ 
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• Segments Alignment for DNA and proteins [(Human G e nom e )] =J 

• Recognition of Objects in Images, 

4.2 DESCRIPTION OF THE RELATED ART 

This invention addresses the problem of keyword spotting (KWS) in unconstrained speech 
without explicit modeling of non-keyword segments (typically done by using filler HMM 
models or an ergodic HMM composed of context dependent or independent phone models 
without lexical constraints). Several methods (sometimes referred to as "sliding model 
methods") tackling this type of problem have already been proposed in the past. E.g., 
they use Dynamic Time Warping (DTW) or Viterbi matching allowing relaxation of 
the (begin and endpoint) constraintsf-ft&J These are known to require the use of an 
"appropriate" normalization of the matching scores since segments of different lengths 
have then to be compared. However, given this normalization and the relaxation of 
begin/endpoints, straightforward Dynamic Programming (DP) is no longer optimal (or, 
in other words, the DP optimality principle is no longer valid) and has to be adapted, 
involving more memory and CPU. Indeed, at any possible ending time e, the match score 
of the best warp and start time 6 of the reference has to be computed [—E4jf ^ (for all 
possible start times 6 associated with unpruned paths). [Moroovor, in [0], and in 
the sam e s pir i t than wh a t is pre s ented h ere, f o r all pos s ibl e e nding tim es 
e, the a verage observation likel i h o o d a l o ng the mo s t lik e ly stat e se qu e nc e 
io u oo d ao scoring criterion , ] ^ Finally, this adapted DP quickly becomes even more 
complex (or intractable) for more advanced scoring criteria (such as the confidence measures 
mentioned below). 

[Moro rocontly,] = Work in the field of confidence level, and in the framework 
of hybrid HMM/ ANN systems has shown [ [1]] that the use of accumulated local 
posterior probabilities (as obtained at the output of a multilayer perceptron) normal- 
ized by the length of the word segment (or, better, involving a double normalization 
over the number of phones and the number of acoustic frames in each phone) was 
yielding good confidence measures and good scores for the re-estimation of iV-best hy- 



potheses. [Similar work, whoro thio kind of confid e nc e meaoure wao comparod 
t o s e v e ral alternative ap p r o ach e s, was rep o rted in [8] and confirmed thi s 
conclusion J ^ However, so far the evaluation of such confidence measures involved 
the estimation and rescoring of N-best hypotheses. [Similar work and concluoiono 
( aloo using N - bost roocoring) woro also roportod in using lik e lihood ratio 
re sco r ing and non k e yword r e j e ction [7] .] „ 

KWS methods without filler models have in common the selection of a subsequence of 
the utterance to match the interesting keyword models. Let X — {xi , x 2 , - - - , x n , . . . , xn} 
denote the sequence of acoustic vectors in which we want to detect a keyword, and let M 
be the HMM model of a keyword M and consisting of L states Q = {91, 92, ... , 94, ... , 9l}- 
Assuming that M is matched to a subsequence X% = {x b , . . . ,x e } (l<6<e<iV)ofX, 
and that we have an implicit (not modeled) garbage/filler state qo preceding and following 
M, {we] _^one can^ define (approximate) the log posterior of a model M given a subsequence 
X§ as the average posterior probability along the optimal path, i.e.: 

-logF(M|^) * —L^ min -logP(Q|^) 

-r=iri«gr {_l08P(, * lto) 

- |>g + logP(<f + V)] 

n=b 

-logP(q e \x e )-\ogP(q G \q e )} (1) 

where Q = {g 6 ,^ 1 , ...,9*} represents one of the possible paths of length (e — b + 1) 
in M, and q n the HMM state visited at time n along Q, with q n € Q. In this ex- 
pression, q G represents the "garbage" (filler) state which is simply used here as the 
non-emitting initial and final state of M. Transition probabilities P(9*|9g) and P(QG\q e ) 
can be interpreted as the keyword entrance and exit penalties, [ as o ptiminod in [3 3 , 
b ut thooo havo not b e en optimiz e d her e.] Jbut can be simvlv set to [Alth o ugh 
an expr e ssion similar to (1) could aloo bo writt e n for likelihoods (as in 
r e gular HMM^ba s od systoms) , wo will mainly uoo pootoriors sinc e it has b ee n 
shown in [1] that a similar e x pre ssi o n was yi e lding a g oo d e stimat e of th e 
-co nfid e nc e l e v e l,] ^ [In our caooj ^ Local posteriors P(qi\x n ) [wero estimated] = can 
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be estimated using any of the known techniques: multi-aaussians. code-books. or_ as output 
values of a multilayer perceptron (MLP) used in hybrid HMM/ANN system3 [ [2] .] _ For 
a specific sub-sequence Xj;, expression (1) can easily be estimated by dynamic programming 
since the sub-sequence and the associated normalizing factor (e — b+ 1) are given. However, 
in the case of keyword spotting, this expression should be estimated for all possible 
begin/endpoint pairs {&, e} (as well as for all possible word models), and we define the 
matching score of X on M as: 

S{M\X) = -togP(M\X$) (2) 

where the optimal begin/endpoints {&*,e*}, and the associated optimal path Q*, are the 
ones yielding the lowest average local posterior: 

(Q*> b\ e*> = argmin "* logP(Q|X & e ) (3) 
{QM e-6+1 

Of course, in the case of several keywords, all possible models will have to be evaluated. 

[Aa ahown in [1, 8 ] ,] ffl A double averaging involving the number of frames per phone 
and the number of phones [will] _ usually yields {]• .sfcqfafy. better performance fl ..when 
used to rescore N-best candidates 

(Q*,b\e*)= (4) 

-1 J ( 1 ej \ 
axgmin — £ j—— £ logP(#|x„) 

where J represents the number of phones in the hypothesized keyword model and q" the 
hypothesized phone qj for input frame x n . However, given the time normalization and 
the relaxation of begin/endpoints, straightforward DP is no longer optimal and has to be 
adapted, usually involving more memory and CPU. [A n e w (and simplo) oolution to 
this pr o blem is pr o p os e d i n S e c tion 3 . 1 . ] = 

Filler-based KWS need a simpler decoding step. Although various solutions have been 
proposed towards the direct optimization of (2) [ac» o.g-> in [0]] = , most of the keyword 
spotting approaches today prefer to preserve the optimality and simplicity of Viterbi DP by 
modeling the complete input [ [5] ] = and explicitly [ [ 6 , 9 ]] = or implicitl y [ [3]] ^ model- 
ing non-keyword segments by using so called filler or garbage models as additional reference 
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models- In this case, we assume that non-keyword segments are modeled by extraneous 
garbage models/states qo (and grammatical constraints ruling the possible keyword/non- 
keyword sequences). 

[Let u a conoid e r only tho caoo of detecting one keyword] . It is sufficient to 
consider only the case of detectina one keyword- per utterance at a time. In this case, 
the keyword spotting problem amounts at matching the whole sequence X of length N onto 
an extended HMM model M consisting of the states {qcqi, ■ ■ ■ )9l,9g}> i fl which a path 

6-1 AT-e 

(of length N) is denoted Q = {qo,-qG,q b , 9 6 " 1-1 , -^VtfG, with (b - 1) garbage states 
qc preceding q b and (N — e) states qo following (f, and respectively emitting the vector 
sequences X\~ x and X^ +1 associated with the non-keyword segments. 

Given some estimation of P{qo\x n ) (e.g., using probability density functions trained on 
non keyword utterances), the optimal path Q* (and, consequently 6* and e*) is then given 
by: 

argmin-logP(g|X) 
= axgmin{-logP(Q|X 6 e ) 

-SlogP(? G |x n )- E \ogP(q G \x n )} (5) 

which can be solved by straightforward DP (since all paths have the same length). The main 
problem of filler-based keyword spotting approaches is then to find ways to best estimate 
P(qG\x n ) in order to minimize the error introduced by the approximations. [In [3],] ^ 
Sometimes this value was defined as the average of the N best local scores while, in other 
approaches, this value is generated from explicit filler HMMs. However, these approaches 
will usually not lead to the "optimal" solution given by (2). 

5 BRIEF SUMMARY OF THE INVENTION 

The invention belongs to the technical domain of decoding, classification, alignment and 
matching of data. 
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The invention introduces a new method performing tasks in keyword spotting in utter- 
ances, detection of subsequences in chains of organic matter (DNA and proteins) and recog- 
nition of objects in images. The proposed methods search in an optimized way the matching 
that maximizes, over all the possible matchings, certain confidence measures based on nor- 
malized posteriors. Three such confidence measures are used, two existed in previous work 
in Speech Recognition, and the third one is a new one. 

Application fields for this invention are: man-machine interfaces (using speech recogn i- 
tion; ex: control systems, banking, flight services, etc), coordination systems (for industrial 
robots and automata) and development systems for pharmaceutic products . 

6 BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE 
DRAWINGS 

Not Applicable 



7 DETAILED DESCRIPTION OF THE INVENTION 

[ In tho following, wo ohow that it is possibl e to d e fin e an it e rativ e 
proc e ss »] _ The present invention introduces a fast iterative methods referred to as 
Iterating Viterbi Decoding (IVD) with good/fast convergence properties, estimating the 
value of P(g<?|zn) such that straightforward DP (5) yields exactly the same segmentation 
(and recognition results) than (3). While the same result could be achieved through a 
modified DP in which all possible combinations (all possible begin/endpoints) would be 
taken into account, the method proposed below is much more efficient (in terms of both 
CPU and memory requirements). 

Compared to previously devised "sliding model" methods the first method proposed here 
is based on: 

1. A matching score defined as the average observation § .probability, (posterior) along 
the most likely state sequence. It is indeed believed that local posteriors are more 
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appropriate to the task. 

2. The iteration of a Viterbi decoding algorithm, which does not require scoring for all 
begin/endpoints or N-best rescoring, and which can be proved to (quickly) converge to 
the "optimal" (from the point of view of the chosen scoring functions) solution without 
requiring any specific filler models, using straightforward Viterbi alignments (similar 
to regular filler-based KWS, but \\ Jor some versions = at the cost of a few iterations). 

The IVD method is based on [tho oamo] _q similar- criterion as the filler based ap- 
proaches (5), but rather than looking for explicit (and empirical) estimates of P(q G \x n ) we 
aim at mathematically estimating its value (which will be different and adapted to each 
utterance) such that solving (5) is equivalent to solving (3). Thus, we perform an itera- 
tive estimation of P{qo\x n ), such that the segmentation resulting of (5) is the same than 
what would be obtained from (3). Defining e t = — log P(qc;\x n ) at iteration i, the proposed 
method can be summarized as follows: 

1. Start fl Jhe first iteration, t = 0 >J S from ah initial value 6q = II (it is actually proven 
that the iterative process presented here will always converge to the same solution, in 
more or less cycles with the worst case upper bound of N iterations, independently of 
this initialization, e.g., with II equal with a cheap estimation of the score of a "match"). 

[ In th e e g per im e nto roporbod bolow r [ = In one of the developed versions^ Sq is 
initialized to — log of the maximum of the local probabilities P(qk\xn) for each frame 

An alternative choice [ oould bo] Js- to initialize 6q to a pre-defined threshold score, 
[] -T } _ that expression (1) should reach to declare a keyword "matching" (see step 4 
below). In this last case, if \c inoroaooo[ Xi > ep = at the first iteration, then we can 
(as proven) directly infer that the match will be rejected, otherwise it will be accepted. 

2. Given the [curront] ^ estimate e t of P{qc\x^) at current iteration t, find the optimal 
path {Q t7 b u et) according to (5) and matching the complete input. 

3. [U p dat e (t^t I 1) th e e stimat e d voluo of c t i ] ^Estimate the value ofe t +x= to be 
used in the next iteration as the average of the local posteriors along the optimal path 





Q t (matching the Xj£ resulting of (5) on the keyword model) i.e.: 
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4. [] Jncrement t and, return to (2) iterating until convergence is detected. If we are not 
interested in the optimal segmentation, this process could also be stopped as soon as it 
reaches a e t+1 lower than a (pre-defined) minimum threshold, T, below which we can 
declare that a keyword has been detected. 

Correctness and convergence proof of this process and generalization to other criteria, are 
available: each IVD iteration (from the second iteration) will decrease the value of e t > and the 
final path yields the same solution than (3). The above method has a very good experimental 
convergence speed (3-5 iterations in our tests). For one version of IVD (when £o is initialized 
using the acceptance threshold, T), the detection is decided after one single step. 

A version with the same effort but suboptimal results is proposed in the following para- 
graph. Let T(M, X) be [th e DP tabl e o f] = a matrix holdina the HMM. emission proba- 
bilities for an utterance X whose time-frames define the columns, and where the states of 
the hypothesized word W define the rows. When [oolving by] -using the, standard DP, 
one [ wo uld] ^ computes for each element of the [tablo j .matrix^ T(M,X) at frame A; of 
X and state s of M three values: S ks , L k8 and C ks , where S ka corresponds to the sum of 
the [ pootorioro] = entries = on the optimal path that leads to the entry, L ks holds the length 
of the optimal path computed so far, and C ks is the estimation of the cost on the optimal 
expanded path. By a path leading to an entry T(k, s) we mean a sequence of entries in the 
table T, such that there is exactly an entry for each time frame t<k. At each entry T(k, s), 
DP selects a locally optimal path noted P ks . At each step k, we consider all pairs of entries 
of table T(M, X) of type T(&, s), T(k - 1, i). We update for each such pair, the current cost 
Cks (initially oo), by comparing it with the alternative given by: 



S ks = S(*_i )t - logp(s\x k )p(s\t) 



L ks = L {k _ 1)t + l,Vt>0,t<L 
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wanting to have at step k the path Pk 8 from the paths P(k-i)t that minimizes Cnl- With 
DP, one will choose the Pk 3 with minimal C* a . 

{In ord e r for the proviouo computation to bo corroct,} -This version can yield 
subovtimal results since^ the optimality principle [ needs Lu be xespected ] -is not respected 
by the expression 7^ The optimality principle of Dynamic Programming requires that the 
path to the frame k — 1 that minimizes CVl, also minimizes C^ s for an entry at frame 
k of table T{M,X). - [We have proved that thy ejipieaaiuu 7 does n o t r e s pe ct th e 
optimality principle of Dynamic Pr o gra m m in g \ = 

Another technique that is suboptimal in time and/or quality is obtained from the previous 
one adopting a beam-search approach and a set of safe prunings. The Dynamic Programming 
can be viewed as a set of safe prunings that are applied at each entry of the DP table and 
has the property that only one alternative is maintained. Dynamic Programming cannot be 
used, since the principle of optimality is not respected. The following types of safe pruning 
that can be done are introduced by the present invention. Within the current invention we 
found a set of safe prunings as follows: we have proved that if at a frame a we have two paths 
P* a and P" with SI < S' a and L' a < ££, then at no frame c>a will a path P c " be forsaken for 
a path P' c if i*Ci* P"^ p " ^ P c\ p L= P c\ P a- We w* 11 note the order relation as P^P^ 
We have further shown that a path P* may be safely discarded only {*oaeJ jwhen we know- a 
lower cost one, P" . 

P^P" C k < C'l (8) 

Thus, the method described in following method computes S(M, X) and Q* from equa- 
tion (3). By ordering the set of paths, according to Equation 8, we only need to check the 
step (1.1) of the following method up to the eventual insertion place. The last paths are 
candidates for pruning in step (1.2). In order for the pruning to be acceptable, we will prune 
only paths that were too long on the last state. An additional counter ^ Jor each path = is 
needed for storing the state length. This counter is reset when an entry from another row is 
added and is incremented at each advance with a frame. 

The following steps detail this method for a model W and an utterance X: 



a) Initialize all elements of a matrix, SetOfPathsfL.N, 1..K), to 0 






procedure OncStcp(W,X) — " 
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or ail jrame=i; jrame <= is; jrame++ ao 
















for all candidate pieStlOfPathsfframe 1, 1..K) do 








— | — Add(p t , SctQfPatha [frame, stale] ) 

— ead — 
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— GelOfPalhs[ftame, K] «— beat of llie candidates — — 

ii»r\pnr1 1 1 fn A rl H / nnth Qf>i-n f- nn.t_h.st 1 
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or ail pitset-uj-puuis uu 
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— if patifo ^ then 






— | — doloto pt 

— end 
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— if p % <path then — - 

— | — return . 






end ~ 
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— Insert p, in sct-of-paths — — 


ertcn 



Algorithm I s One Step Algorithm 



b) For all frames from 1 to N, for all states from 1 to K, for all candidates r>i in 
SetOfPaths(frame-l, 1..K): 

- For all pj in SetOfPaths [frame, state], if Pi^Pj then delete pj (1.1), and if Pj<p% 
then continue step b) (1.2) 

- Insert pj in SetOfPaths [frame, state] 

c) Select SetOfPaths[frame, K] as the best of the candidates 



The next method builds on the previous technique and is a fast procedure for maximizing 
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a more complex confidence measure that yields better results in practice. The corresponding 
confidence measure is defined as: 

NVP kkvP length(hi) W 
where NVP stands for the number of visited phonemes and VP stands for the set of visited 
phonemes. An average is computed over all posteriors pst of the emission probabilities for the 
time frames matched to the visited phoneme hi. The function length(hi) gives the number 
of time frames matched against h^ This method uses a breath first Beam Search algorithm. 
It exploits a set of reduction rules and certain normalizations. For the state qo, in this 
method, the logarithm of the emission posterior is equal with zero. For each frame e and for 
each state s, the set of paths/probabilities of having the frame e in the state s is computed 
as the first J\f maxima (Af can be finite) of the confidence measure for all paths in HMM 
M of length e and ending in the state $. The paths that according to the reduction rules 
will loose the final race when compared with another already known path, will be deleted 
as well, fofo} Lp± tut note ai, p u Z 1? -[] -resvectivelv- a 2 , P2 and l 2 the confidence measure for 
the previously visited phonemes, the posterior in the current phoneme and the length in the 
current phoneme for the path Q u respectively the path Q 2 . The rules that can be used for 
the reduction of the search space by discarding a path Q x for a path Q 2 are in this case any 
of the next ones: 

1. l 2 >l u A > 0, B < 0 and L\A + L C B + C>0 

2. l 2 >l u A^OjB^OandC^O 

3. l 2 >l u A < 0, C > 0 and L 2 A + LB + C > 0 

4. l 2 >l u A = 0, B<0 and LB + C>0 

where A = ai - a 2 , B = (a x - Qajih + l 2 ) + p x - p 2j C=(ai - a^hk + p x l 2 - p 2 l u L = 
£max — max{Z 1} J 2 }, L c = —B/2A > 0 and L^ nax is the maximum acceptable length for a 
phoneme. By discarding paths only if one of the above rules is satisfied, the optimum defined 
by the confidence measure with double normalization can be guaranteed, if no phone may be 
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avoided by the HMM M. Any HMM may be decomposed in HMMs with this quality. The 
4^th rule is included in the 3-rd and its test is useless if the last one was already checked, 
fl first test, l 2 > h tells us if Q 2 has chances to eliminate Qi, otherwise we will check 
if Qi eliminates Q 2 . These tests were inferred from the conditions of maintaining the final 
maximal confidence measure while reduction takes place. In order to use the method of 
double normalization without decomposing HMMs that skip some phonemes, the previous 
rules are modified taking into account the number of visited phonemes for any path Fi 
respectively F 2 and the number of phonemes that may follow the current state. A simplified 
test [may] = can = be: 

• h > h, A > 0, pi > p 2 respectively F 2 >F t for the HMMs that skips phonemes. 

This test is weaker than the 2 nd reduction rule. For example a path is eliminated by a second 
path if the first one has an inferior confidence measure (higher in value) for the the previous 
phonemes, a shorter length and the minus of the logarithm of the cumulated posterior in 
the current phoneme also inferior (higher in value) to that of the second one. An additional 
confidence measure based on the maximal length, Z^x, and on the maximum of the minus 
of the logarithm of the cumulated and normalized posterior in phoneme, Pmax, can be used 
in order to limit the number of stored paths. 

• P > LmaxPrnax in any state 

• f > Pmax &t the output from a phoneme 

where p and 1 are the values in the current phoneme for the minus of the logarithm of 
cumulated posterior and for the length of the path that is discarded. These tests allow for 
the elimination of the paths that are too long without being outstanding, respectively of 
the paths with phonemes having unacceptable scores, otherwise compensated by very good 
scores in other phonemes. If M is chosen equal with one, the aforementioned rules are no 
longer needed, but always we propagate the path with the maximal current estimation of 
the confidence measure. The obtained results are very good, even if the defined optimum is 
guaranteed for this method only when ]\f is bigger than the length of the sequence allowed 
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by l»mnx or of the tested sequence. The same approach is valid for the simple normalization, 
where the HMM for the searched word will be grouped into a single phoneme. 

The present invention can exploit a newly designed a confidence measure, version named 
"Real Fitting", that represents differently the exigencies of the recognition. Since the 
phonemes and the absent states can be modeled by the used HMMs, we find it interest- 
ing to request the fitting of each phoneme in the model with a section of the sequence. 
Therefore, we measure the confidence level of a subsequence as being equal with the max- 
imum over all phonemes of the minus of the logarithm of the cumulated posterior of the 
phone, normalized with its length: 



The rule that may be used in this framework for the reduction of the number of visited paths 
is: 

• Q 2 is discarded in favor of another path Qi if the confidence measure of the Real 
Fitting for the previous phonemes is inferior (higher in value) for Q 2 compared with 
Qi 9 and if pi < P2 and I2 < h- 

where p u l u respectively P2, h represent the minus of the logarithm of the cumulated poste- 
rior respectively the number of frames in the current phoneme for the path Qi respectively 
Q2- Similarly to the previous method, the set of visited paths can be pruned by discarding 
those [that ] .where. : 

• p > I'max^mox in any state 

• f > Pmax at the output from a phoneme 

where p and 1 are the values in the current phoneme for the minus of the logarithm of the 
cumulated posterior and for the length of the path that is discarded. We recall that the 
meaning of the constants are the maximal length Lmoxj respectively the accepted maxima 
of the minus of the logarithm of the cumulated and normalized posterior in phoneme, Pmax- 
This invention thus proposes a new method for keyword spotting, based on recent 
advances in confidence measures, using local posterior probabilities, but without requiring 



max 

phoneme Visited Phonems 



Ephonem ~ logjposteriors) 
phonem length 
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the explicit use of filler models. A new [cdgorithmj .methods referred to as Iterating 
Viterbi Decoding (IVD), to solve the above optimization problem with a simple DP process 
(not requiring to store pointers and scores for all possible ending and start times). Other 
three [] -new, beam-search algorithms corresponding to three different confidence mea- 
sures are also [doocribod ] jprovosed^ [Whilo tho propocod approach alio wo for on 
ea s y gonoraliaation to moro complox criteria, pr e liminary r e oulta obtain e d 
on th e baais o f 100 k e yw or ds (and w ithout any sp e cific tuning) app e ar to b e 
p articularly c o mpetitiv e t o oth e r alt e rnative approachoo.J = 
-To summarize- , the object of the invention consists of: 

• Method of recognition of a subsequence using a direct maximization of confidence 
measures. 

• The method of IVD for directly maximizing the confidence measures based on simple 
normalization. 

• The use of the confidence measure and method of recognition named 'Real Fitting', 
based on individual fitting for each phoneme. 

• Methods of recognition using simple and double normalization by: 

• combining these measures with additional confidence measures mentioned here, respec- 
tively the maximal length and real matching limitation. 

• The use of the aforementioned methods in keyword recognition. 

• The use of the aforementioned methods in subsequence recognition of organic matter. 

• The use of the aforementioned methods in recognition of objects in images. 

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
Execution: The method can be performed using a personal computer or can be imple- 
mented in specialized hardware. 

1. A representation under the form of an HMM is obtained for the subsequences that are 
looked for (word, protein profile, section of an image of the object). 

14 



2. A tool will be obtained (eventually trained Ex: for speech recognition) for the esti- 
mation of the posteriors. For example multi-Gaussians, neuronal networks, clusters, 
database with Generalized Profiles and mutation matrices (PAM, BLOSSUM, etc.). 

3. One of the proposed algorithms should be implemented. They yield close performance 
but the method of Real Fitting coupled with a well checked dictionary should perform 
best. 

For the first algorithm (IVD) 

(a) The classic algorithm of Viterbi is implemented with the modification that, for 
each pair P = (sample, state) one propagates the time-frame of transition be- 
tween the state qc and the states of the HMM M for the path that arrives at P. 
These are inherited from the path that wins the entrance in the pair P, excepting 
for the moment when their decision is taken, namely when they receive the index 
of the corresponding sample. 

(b) w = — log P(M\Xj;) is computed by subtracting from the cumulated posterior 
that is returned by the Viterbi algorithm for the path Qg, the value (N — (e t — 
h + 1)) * £t corresponding to the contribution of the states qo and dividing the 
result through e t — &* + 1. ^ — b t +l from the previous formula can be factored 
outside the fraction. 

(c) The initialization of e is made with an expected mean value. One can use the w 
that is computed when the state qo is associated with an emission posterior equal 
to the average of the best K emission probabilities of the current sample as done 
in the well-known "garbage on-line model" . In this case, K is trained using the 
corresponding technique. 

The next 'Beam search' algorithms, are implemented according to the description in 
the corresponding sections. For each pair P = (sample, state) one computes for each 
corresponding path the sum and length in the last phoneme, as well as the sum over 
the normalized cumulated posteriors of the previous phonemes (and their number). 
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Also, the entrance and exit samples into the HMM M are computed and propagated 
like in the previous method, in order to ensure the localization of the subsequence. 

4. If one searched entity (keyword, sequence, object) can have several HMM models, all 
of them are taken into consideration as competitors. This is the case of the words 
with several pronunciations (or of the objects that have different structures in different 
states, for the recognition in images). 

After the computation of the confidence measure for each model of the subsequences, 
one eliminates those with a confidence measure in disagreement with a 'threshold' that 
is trained for the configuration and the goal of the given application. For example, for 
speech recognition with neuronal networks and minus of the logarithm of the posteriors, 
the 'threshold' is chosen in the wanted point of the ROC curve obtained in tests. 

5. The remained alternatives are extracted in the order of their confidence measure and 
with the elimination of the conflicting alternatives until exhaustion. Each time when 
an alternative is eliminated, the searched entity with the corresponding HMM is re- 
estimated for the remaining sections in the sequence in which the search is performed. 
If the new confidence measure passes the test of the 'threshold', then it will be inserted 
in the position corresponding to its score in the queue of alternatives. 

6. The successful alternatives can undergo tests of superior levels like for example a 
question of confirmation for speech recognition, opinion of one operator, etc. 

7. For objects recognition in images: 

Posteriors are obtained by computing a distance between the color of the model and 
that of element in the section of the image. If the context requires, the image will be 
preprocessed to ensure a certain normalization (Ex: changeable conditions of light will 
make necessary a transformation based on the histogram). 

The phonemes of the speech recognition correspond to parts of the object. The struc- 
ture (existence of transitions and their probabilities) can be modified, function of the 
characteristics detected along the current path. For example, after detecting regions 
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of the object with certain lengths, one can estimate the expected length of the remain- 
ing regions. Thus, the number of the expected samples for the future states can be 
established and the HMM attached to the object will be configured accordingly. 

A direction is scanned for the detection of the best fitting and afterwards, other direc- 
tions will be scanned for discovering new fittings, as well as for testing the previous 
ones. The final test will be certified by classical methods such as cross-correlation or 
by the analysis of the contours in the hypothesized position. 

[Horo wo prosont j -To mention- some examples for the application of the proposed 
method -f in th e indu s try] = : 

• The recognition of keywords begins to be used in answering automates of banking 
system as well as telephone and automates for control, sales or information. The 
method offers a possibility to recognize keywords in spontaneous speech with multiple 
speakers. 

• The recognition of DNA sequences is important for the study of the human Genome. 
One of the biggest problem of the involved techniques consists in the high quantity of 
data that have to be processed. 

• The recognition of objects in images is used, among others, in cartography and in the 
coordination of industrial robots. The method allows a quick estimation of the position 
of the objects in scenes and can be validated with extra tests, using classical methods 
of cross-correlation. 
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Speech Recognition and Signal Analysis by straight 
Search of Subsequences with Maximal Confidence 

Measure 

1 Field of the invention 

The iny^ntion. relates to a common cpnaponent of: 

• Speech Recognition 

• Keyword Spotting 

• Segments Alignment for DNA and proteins (Human Genome) 

• Recognition of, Objects in Images 

2 Background Art 

This invention addresses the problem of keyword spotting (KWS) I in unconstrained speech 
without explicit modeling of non-keyword segments (typibally done by using filler HMM 
models or an ergodic HMM composed of context dependent or independent phone models 
without lexical constraints)- Although several algorithms (sometimes referred to as ''sliding 
model methods") tackling this type of problem have already been proposed in the past, e.g., 
by using Dynamic Time Warping (DTW) [4] or Viterbi matching [9] allowing relaxation 
of the (begin and endpoint) constraints, th^se are known to require the use of an "appro- 
priate" normalization of the matching scores, since segments of different lengths have then 
to be compared. However, given this normalization and the relaxation of begin/endpoints, 
straightforward Dynamic Programming (DP) is no longer optimal (or, in other words, the 
DP optimality principle is no longer valid) and has to be adapted, involving more memory 
and GPU. Indeed, at any possible ending time e, the match score of the best warp and start 
time b of the reference has to be computed [4] (for all possible start times b associated with 
unpruned paths). Moreover, in [9], and in the same spirit than what is presented here, for 



sequence is used as scoring criterion. Finally, this adapted DP quickly becomes even more 
complex (or intractable) for more advanced scoring criteria (such as the confidence measures 
mentioned below). 

More recently, work in the field of confidence level, and in the framework of hybrid 
HMM/ANN systems, it was shown [1] that the use of accumulated local posterior proba- 
bilities (as obtained at the output of a multilayer perceptron) normalized by the length of 
the word segment (or, better, involving a double normalization over the number of phones 
and the number of acoustic frames in each phone) was yielding good confidence measures 
and good scores for the re-estimation of AT-best hypotheses; Similar work, where this kind 
of confidence measure was compared to several alternative approaches, was reported in [8] 
and confirmed this conclusion. However, so far, the evaluation of such confidence measures 
involved the estimation and rescoring of N-best hypotheses. Similar work and conclusions 
(also using N-best rescoring) were also reported in using likelihood ratio rescoring and non- 
keyword rejection [7]. 

2.1 KWS without filler models 

Let X = {xi, X2, - - - , x n , . . . , xw} denote the sequence of acoustic vectors in which we want 
to detect a keyword, and let M be the HMM model of a keyword M and consisting of 
L states Q = {q u ^, . . . , q i} . . . , q L }. Assuming that M is matched to a subsequence Xf; = 
. . . , x e ] (1 < 6 < e < N) of -Y, and that we have an implicit (not modeled) garbage/filler 
state qo preceding and following M, we define (approximate) the log posterior of a model 
M given a subsequence X§ as the average posterior probability along the optimal path, i.e.: 

-logP(Ml^) " ~ _l_^n _logP(g|xe) 

- ENTO + logW V)]' 

n— 6 

-logW-iogPWrt}' (i) 

where Q == {g 6 , if* 1 , q e ] represents one of the possible paths of length (e- b+ 1) in M, and 
the HMM state visited at time n along Q, with if* G Q. In this expression, qo represents 
the "garbage" (filler) state which is simply used here as the non-emitting initial and final 
state of M. Transition probabilities P{if*\qG) and P(Qg\^) can be interpreted as the keyword 



entrance and exit penalties, as optimized in [3], but these have not been optimized here. In 
our case, local posteriors P{qt\x n ) were estimated as output values of a multilayer perception 
(MLP) used in a hybrid HMM/ANN system [2]. 

For a specific sub-sequence X%, expression (1) can easily be estimated by dynamic pro- 
gramming since the sub-sequence and the associated normalizing factor (e- b+ 1) are given. 
However, in the case of keyword spotting, this expression should be estimated for all possi- 
ble, begm/endpoint pairs {b,e} (as well as for all possible word models), and we define the 
matching score of X on M as: 

S(M\X) = -logP(M\X£) (2) 

where the optimal begin/endpoints {b*,e*}, and the associated optimal path Q*, are the 
ones yielding the lowest average local posterior: 

(Q\ &*, O = argmhi -J±-bi P (g\x*) ( 3 ) 

Of course, in the case of several keywords, all possible models will have to be evaluated. 

As shown in [1, 8], a double averaging involving the number of frames per phone and the 
number of phones will usually yield slightly better performance: 

<Q\&*,e*> = (4) 

where J represents the number of phones in the hypothesized keyword model and the 
hypothesized phone gj for input frame x n . 

However, given the time normalization and the relaxation of begin/endpoints, straight- 
forward DP is no longer optimal and has to be adapted, usually involving more memory and 
CPU. A new (and simple) solution to this problem is proposed in Section 3.1. 

2.2 FiUer-based KWS 

Although various solutions have been proposed towards the direct optimization of (2) as^ 
e.g^ in [4, 9], most of the keyword spotting approaches today prefer to preserve the opti- 
mality and simplicity of Viterbi DP by modeling the complete input [5] and explicitly [6] or 
implicitly [3] modeling non^keyword segments by using so called filler or garbage models as 



by extraneous garbage models/states q G (and grammatical constraints ruling the possible 
keyword/non-keyword sequences). 

Let us consider only the case of detecting one keyword per utterance at a time. In this 
case, the keyword spotting problem amounts at matching the whole sequence X of length 
N onto an extended HMM model M consisting of the states {q G , ft, . . : , qu Qg], in which 

_ 6-1 N-e 

a path (of length N) is denoted Q = {$gT^^ with (&-1) garbage 

states q G preceding and (N-e) states q G following <f , and respectively emittihg the vector 
sequences Xy" 1 and X^ t associated with the non-keyword segments. 

Given some estimation of P(q G \x n ) (e.g., using probability density functions trained on 
non keyword utterances), the optimal path Q* (and, consequently &* and e*) is then given 
by: 

W = argmin - log P(Q\X) 

VQ€M 

= axgi^{-logP(Q|X 6 e ) 

-Y;iogP(q G \x n )- f; logP(q G \x n )} (5) 

which can be solved by straightforward DP (since all paths have the same length). The main 
problem of filler-based keyword spotting approaches is then to find ways to best estimate 
p (<lG\xn) in order to minimize the error introduced by the approximations. In [3], this value 
was defined as the average of the N best local scores while, in other approaches, this value 
is generated from explicit filler HMMs. However, these approaches will usually not lead to 
the "optimal" solution given by (2). 

3 Disclosure of Invention 

3-1 Iterating Viterbi Decoding (IVD) 

In the following, we show that it is possible to define an iterative process, referred to as 
Iterating Viterbi Decoding (IVD) with good/fast convergence properties, estimating the value 
of P(q G \x n ) such that straightforward DP (5) yields exactly the same segmentation (and 
recognition results) than (3). While the same result could be achieved through a modified 
DP in which all possible combinations (all possible begin/endpoints) would be taken into 

arrrmnt it. is nnftsihlp tn show thai, thp alcrnrithm nrnnospH hplnw ir mnrp pffiriPnt. fin tprma 



of both CPU and memory requirements). 

Here, I will use a similar scoring technique for keyword spotting without explicit filler 
model. Compared to previously devised "sliding model" methods (such as [4, 9]), the first 
algorithm proposed here is based on: 

1. A matching score defined as the average observation posterior along the most likely 
state sequence. It is indeed believed that local posteriors (or likelihood ratios, as in [7]) 
are more appropriate to the task. 

2. The iteration of a, Viterbi decoding algorithm, which does not require scoring for all 
begin/endpoints or N-best rescoring, and 1 which can be proved to (quickly) converge to 
the "optimal" (from the point of view of the chosen scoring functions) solution without 
requiring any specific filler models, using straightforward Viterbi alignments (similar 
to regular filler-based KWS, but at the cost of a few iterations). 

3-2 IVD: Description 

The IVD algorithm is based on the same criterion than the filler based approaches (5), but 
rather than looking for explicit (and empirical) estimates of P{qa\x n ) we aim at mathemat- 
ically estimating its value (which will be different and adapted to each utterance) such that 
solving (5) is equivalent to solving (3). Thus, we perform an iterative estimation of P(qG\x n ), 
such that the segmentation resulting of (5) is the same than what would be obtained from (3). 
Defining e = - logP(q G \x h ), the proposed algorithm can be summarized as follows: 

1. Start from an initial value e 0 = e (it is actually proven that the iterative process 
presented here will always converge to the same solution (in more or less cycles, with 
the worst case upper bound of N iterations) independently of this initialization), (e.g., 
with e equal with a cheap estimation of the score of a "match"). In the experiments 
reported below, e was initialized to - log of the maximum of the local probabilities 
P(qk\x n ) for each frame x n . 

An alternative choice could be to initialize e 0 to a pre-defined score that expression (1) 
should reach to declare a keyword '"matching" (see point 4 below). In this last case, if 
e increases at the first iteration, then we can (as proven) directly infer that the match 
will be rejected, otherwise it will be accepted. 



2. Given the current estimate e t of P(q G \x n ) at iteration t, find the optimal path (Q t , bi, e t ) 
according to (5) and matching the complete input. 

3. Update (t = the estimated value of e t , defined as the average of the local posteriors 
along the optimal path Q t (matching the Xj£ resulting of (5) on the keyword model) 

■ i.e.: > ■ 

4. Return to (2) and iterate until convergence. If we are not interested in the optimal 
segmentation, this process could also be stopped as soon as e reaches a (pre-defined) 
minimum threshold below which we can declare that a keyword has been detected. 

Correctness and convergence proof of this process and generalization to other criteria, are 
available: each IVD iteration (from the second iteration) will decrease the value of e u and 
the final path yields the same solution than (3). 

3.3 One-pass keyword spotting 

3.3.1 General Description 

The above algorithm has a very good experimental convergence speed (3-5 iterations in our 
tests). However, the worst case theoretical convergence speed of the process is N. For this 
reason, a one step computation is potentially interesting. In the next subsection we show 
that the standard DP cannot be used for solving the equation (3). 

3.3.2 The Principle of Optimality 

Let us define T(M, X) as the DP table of emission probabilities for an utterance X and 
the states of the hypothesized word W. When solving by standard DP, we would compute 
for each entry of the table T(M,X) at frame k of X and state s of M three values: S kB , 
Lks and Ck 3 > where 5* a corresponds to the sum of the posteriors on the optimal path that 
leads to the entry, L ks holds the length of the optimal path computed so far, and C ks is the 
estimation of the cost on the optimal expanded path. 

By a path leading to an entry T(k, s) we mean a sequence of entries in the table T, such 
that there is exactly an entry for each time frame t<k. At each entry T(Jfc, 5), DP selects a 
locally optimal path noted P* fi . 



At each step k, we consider all pairs of entries of table T(M, X) of type T(k, s), T(k-1, t). 
We update for each such pair, the current cost C ka (initially, oo), by comparing it with the 
alternative given by: 

Sk, = 5 (fc _ 1)t - \ogp(s\x k )p(s\t) 
L ka = L (k . t)t + l i yt>0,t<L 



wanting to have at step k the path P kB from the paths P (fc _ 1)t that minimizes C NL . With 
DP, one will choose the P ka with minima l C ka . 

In order for the previous computation "to be correct, the optimality principle needs to 
be respected. The optimality principle of Dynamic Programming requires that the path to 
the frame k - 1 that minimizes C NL , also minimizes C ka for an entry at frame A; of table 
T(M,X). We have proved that the expression 7 does not respect the optimality principle 
of Dynamic Programming 

3.3.3 Pruning with beam search 

The Dynamic Programming can Be viewed as a set of safe prunings that are applied at each 
entry of the DP table and has the property that only one alternative is maintained. We have 
thus shown that Dynamic Programming cannot be used, since the principle of optimality is 
not respected. We try therefore to detect the type of safe pruning that can be done. 

We have proved that if at a frame a we have two paths and P£ with S£ < S' a and 
L a < L", then at no frame c>a will a path P* be forsaken for a path P£ if P„Ci* F^CP" 
and Py&sP^p;. We will note the order relation as i^^. We have further shown that 
a path P' may be discarded only for a lower cost one, P" . 

P'-<P"=*C' k <C' k ' (8) 

Thus, algorithm 1 computes S(M,X) and Q* from equation (3). 

By ordering the set of paths, according to Equation 8, we only need to check the line 1.2 
of algorithm 1 up to the eventual insertion place. The last paths are candidates for pruning 
in line 1.1. In order for the pruning to be acceptable, we will prune only paths that were 
too long on the last state. An additional counter is needed for storing the state length. This 
counter is reset when the state is changed and is incremented at each advance with a frame. 



procedure OneStep(W 7 X) 




SetOfPaths(l..N, L.K)<-0 




for all frame=l; frame <= N; frame ++ do 






for all state=l; state <= K; state++ do 








for all candidate PiESetOf Paths (frame- 1, 1..K) do 








| Addfo, SetOfPaths[frame, state]) 








end 






end 




end 






SetOfPaths[frame, K] <- best of the candidates 


end. 




procedure Addfpaih, set-df-paths) 




for all PiEset-of-paths do 


1.1 




if path-<pi then 






I 


delete pi 






end 


1.2 




if pi^path then 






i 


return 






end 




end 






Insert pi in set-of-paths 


end. 







Algorithm 1: One Step Algorithm 



3.4 One pass confidence-based keyword spotting 
3.4.1 The Method of Double Normalization 

The corresponding confidence measure is defined as: 

1 ' S|»t^-lQg(prf) /q\ 

NVP P tvp lengthipi) W 

where NVP stands for the number of visited phonemes and VP stands for the set of visited 
phonemes. An average is computed over all posteriors pst of the emission probabilities for the 
time frames matched to the visited phoneme p$. The function length(pi) gives the number 
of time frames matched against ft- 

This method consists into a breath first Beam Search algorithm. It refers to a set of 



reduction rules and certain normalizations: 

For the state q G , in this method, the logarithm of the: emission posterior is equal with 
zero. For each frame e and-for each state s, the set of paths/probabilities of having the frame 
e in the state s is computed as the first N maxima (N can be finite) of the confidence measure 
for all paths in HMM M of length e and ending in the state s. The paths that according 
to the reduction rules will loose the final race when compared with another already known 
path, will be deleted as well. 

We note a u p u l u 02, P2 and l 2 the confidence measure for the previous phonemes, the 
posterior in the current phoneme and the length in the current phoneme for the path Q u 
respectively the path Q 2 . The rules that may be used for the reduction of the search space 
by discarding a path Q t for a path Q 2 are in this case any of the next ones: 

1. k>lu A>0, B < 0 and L\A + L C B + C > 0 

2. k>lu A > 0, B > 0 and C > 0 

3. k>lu A < 0, C > 0 and I?A + LB + C > 0 

4. l 2 >l u A = 0, B < 0 and LB + C > 0 

where A = a t -a 2 ,. B ; = (a t « oaK^i + h) +Pi - P2, C={a x - 02)^2 +Pih - M, L = 
Lmax — max{fi,f 2 }, L c = —B/2A > 0 and Lmox is the maximum acceptable length for a 
phoneme. 

By discarding paths only if one of the : above rules is satisfied, the optimum defined by 
the confidence measure with double normalization can be guaranteed, if no phone may be 
avoided by the HMM M . Any HMM may be decomposed in HMMs with this quality. The 
4-th rule is included in the 3-rd and its test is useless if the last one was already checked. 

First test, l 2 > h tells us if Q 2 has xhances to ehrninate Qi, otherwise we will check if 
Qi eliminates Q %; These tests werje inferred from the conditions of maintaining the final 
maximal confidence measure while reduction takes place. In order to use the method of 
double normalization without decomposing HMMs that skip some phonemes, the previous 
rules are modified taking into account the number of visited phonemes for any path Fi 
respectively F 2 and the number of phonemes that may follow the current state. 

A simplified test may be: 



This test is weaker than the 2 nd reduction rule. For example a path is eliminated by 
a second path if the first one has an inferior confidence measure (higher in value) for the 
the previous phonemes, a shorter length and the minus of the logarithm of the cumulated 
posterior in the current phoneme also inferior (higher in value) to that of the second one. 

An additional confidence measure based on the maximal length, Z/mas, and on the maxi- 
mum of the minus of the logarithm of the cumulated and normalized posterior in phoneme, 
Pmax, can be used in order to limit the number of stored paths. 

• P > LynaxPmax in any state 

• f > Pmax at the output from a phoneme 

where p and 1 are the values in the current phoneme for the minus of the logarithm of 
cumulated posterior and for the length of the path that is discarded. These tests allow for 
the elimination of the paths that are too long without being outstanding, respectively of 
the paths with phonemes having unacceptable scores, otherwise compensated by very good 
scores in other phonemes. 

If N is chosen equal with one, the aforementioned rules are no longer needed, but, always 
we propagate the path with the maximal current estimation of the confidence measure. The 
obtained results are very good, even if the defined optimum is guaranteed for this method 
only when N is bigger than the length of the sequence allowed by or of the tested 
sequence. 

The same approach is valid for the simprte normalization, where the HMM for the searched 
word will be ^grbuped into a sifigle phoneme. 

3.4.2 The Method of Real Pitting 

We have -also defined a new confidence measured that represents differently the exigencies 
of the recognition. Since the phonanes and the absent states can be modeled by the used 
HMMs, we find it interesting to request the fitting of each phoneme in the model with a 
section of the sequence. Therefore, we measure the confidence level of a subsequence as being 
equal with the maximum over all phonemes of the minus of the logarithm of the cumulated 
posterior of the phone, normalized with its length. 

max I*hanem-log(po$teriors) - 



The rule that may be used in this framework for the reduction of the number of visited 
paths is: 

• Q 2 is discarded in favor of another path Q x if the confidence measure of the Real 
Fitting for the previous phonemes is inferior (higher in value) for Q 2 compared with 
Qu and if Pi <P2 and 1% < l x . 

where p u l l9 h represent the minus of the logarithm of the cumulated posterior respec- 
tively the number of frames in the current phoneme for the path Qi respectively Q 2 - 

Similarly to the previous method* the set of visited paths can be pruned by discarding 
those that: 

• V > LmaxPmax in any state 

• f > Pmax at the output from a phoneme 

where p and 1 are the values in the current phoneme for the minus of the logarithm of the 
cumulated posterior and for the length of the path that is discarded. We recall that the 
meaning of the constants are the maximal length Lmax\ respectively the accepted maxima 
of the minus of the logarithm of the cumulated and normalized posterior in phoneme, P^. 

3.5 Conclusions 

We have thus proposed a new method for keyword spotting, based on recent advances in 
confidence measures, using local posterior probabilities, but without requiring the explicit 
use of filler models. : v 

A new algorithm, referred to as Iterating Viterbi Decoding (IVD), to solve the above 
optimization problem with a simple DP process (not requiring to store pointers and scores 
for all possible ending and start times), at the cost of a few iterations. Other three beam- 
search algorithms corresponding to three different confidence measures were also described. 

While the proposed approach allows for an easy generalization to more complex criteria, 
preliminary results obtained on the basis of 100 keywords (and without any specific tuning) 
appear to be particularly competitive to other alternative approaches. 

3.6 The object of the invention consists of: 

• Method of recognition of a subsequence using a direct maximization of confidence 



• The method of IVD for directly maximizing the confidence measures based on simple 
normalization. 

• The use of the confidence measure and method of recognition named 4 Real Fitting', 
based on individual fitting for each phoneme. 

• Methods of recognition using simple and double normalization by: 

• combining these measures with additional confidence measures mentioned here, respec- 
tively the maximal length and real matching limitation. 

• The use of the aforementioned methods in keyword recognition. 

• The use of the aforementioned methods in subsequence recognition of organic matter. 

• The use of the aforementioned methods in recognition of objects in images. 

4 Best Mode for Carrying Out the Invention 

Execution: It is necessary to use a computer, but the method can also be implemented in 
hardware. 

1. A representation under the form of an HMM is obtained for the subsequences that are 
looked for (word, protein profile, section of an image of the object). 

2. A tool will be obtained (eventually trained Ex: for speech recognition) for the esti- 
mation of the posteriors. For example midti-Gaussians, neuronal networks, clusters, 
database with Generalized Profiles and mutation matrices (PAM, BLOSSUM, etc.). 

3. One of the proposed algorithms should be implemented. They yield close performance 
but the method of Real Fitting coupled with a well checked dictionary should perform 
best. 

For the first algorithm (IVDi), 

(a) The classic algorithm of Viterbi is implemented with the modification that, for 
each pair P = (sample, state) one propagates the moments of transition between 
< Ahe state q G and the states of the HMM Af for the path that arrives at P. These 

»rp mhArit*vJ frnm fhp T\afch that, wins thp ATttranrA in tho nair P AYr*mtiT*cr for 



the moment when their decision is taken, namely when they receive the index of 
the corresponding sample. 

(b) w = -lojgP(Af JJCjf) is computed by subtracting from the cumulated posterior 
that is returned by the Viterbi algorithm for the path Qg, the value (N - (e - 6+ 
1)) * e corresponding to the contribution of the states q G and dividing the result 
through e-6 + 1. e-6+1 from the previous formula can be factorial outside 
the fraction. 

( c ) Th e initialization of e is made with an expected mean value. One can use the w 
that is computed when the state q G is associated with an emission posterior equal 
to the average of the best K emission probabilities of the current sample as done 
in the well-known "garbage on-line model". In this case, K is trained using the 
corresponding technique. ; 

The next 'Beam search* dgorithins, 'are implemented according to the description in 
the corresponding sections. For each pair P = (sample, state) one computes for each 
corresponding path the sum and length in the last phoneme, as well as the sum over 
the normalized cumulated posteriors of the previous phonemes (and their number). 
Also, the entrance arid exit samples into the HMM M are computed and propagated 
like in the previous method, in order to ensure the localization of the subsequence. 

4. If one searched entity (keyword, sequence, object) can have several HMM models, all 
of them are taken into consideration as competitors. This is the case of the words 
with several pronunciations (or of the objects that have different structures in different 
states, for the recognition in images). ■ ■ * 

After the computation of the confidence measure for each model of the subsequences, 
one eliminates those with a confidence measure in disagreement with a 'threshold' that 
is trained for the configuration and the goal of the given application. For example, for 
speech recognition with neuronal networks and minus of the logarithm of the posteriors, 
the 'threshold* is chosen in the wanted point of the ROC curve obtained in tests. 

5. The remained alternatives are extracted in the order of their confidence measure and 
with the elimination of the conflicting alternatives until exhaustion. Each time when 
an alternative is eliminated, the searched entity with the corresponding HMM is re- 



If the new confidence measure passes the test of the 'threshold', then it will be inserted 
in the position corresponding to its score in the queue of alternatives. 

6. The successful alternatives can undergo tests of superior levels like for example a 
question of confirmation for speech recognition, opinion of one operator, etc. 

7. For objects recognition in images: 

Posteriors are obtained by computing a distance between the color of the model and 
that of element in the section of the image. If the context requires, the image will be 
preprocessed to ensure a certain normalization (Ex: changeable conditions of light will 
make necessary a transformation based on the histogram). 

The phonemes of the speech recognition correspond to parts of the object. The struc- 
ture (existence of transitions and their probabilities) can be modified, function of the 
characteristics detected along the current path. For example, after detecting regions 
of the object with certain lengths, one can estimate the expected length of the remain- 
ing regions. Thus, the number of the expected samples for the future states can be 
established and the HMM attached to the object will be configured accordingly. 

A direction is scanned for the detection of the best fitting and afterwards, other direc- 
tions will be scanned for discovering new fittings, as well as for testing the previous 
ones. The final test will be certified by classical methods such as cross-correlation or 
by the analysis of the contours in the hypothesized position. 

5 Industrial Applicability 

Here We present some examples for the application of the proposed method in the industry: 

• The recognition of keywords begins to be used in answering automates of banking 
system as well as telephone and automates for control, sales or information. The 
method offers a possibility to recognize keywords in spontaneous speech with multiple 
speakers. 

• The recognition of DNA sequences is important for the study of the human Genome. 
One of the biggest problem of the involved techniques consists in the high quantity of 
data that have to be processed. 



• The recognition of objects in images is used, among others, in cartography and in the 
coordination of industrial robots; The method allows a quick estimation of the position 
of the objects in scenes and can be validated with extra tests, using classical methods 
of cross-correlation. 
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Speech Recognition and Signal Analysis by straight 
Search of Subsequences with Maximal Confidence 

Measure 

Independent Claim 1. 

Preamble: 

Recognizes subsequences, represented as Hidden Markov Models (HMM), that are searched 
for in a given sequence. 

We refer to the confidence measures, that are used for the reclassification of the winning 
hypotheses in Speech Recognition. These are some examples of such measures: 

simple normalization = accumulated posterior, normalized 

with the length of the subsequence 

double normalization = double normalization of the accumu- 
lated posterior over the number of 
phonemes and over the number of 
acoustic samples in each phoneme. 



characterized by: It allows the additional confidence measure, based on the extremes of 
the values of the logarithm of the accumulated posterior in each phoneme, normalized with 
its length. We call this measure 'real fitting'. 

™ v phoneme ~ logjpdsieriors) 

iii ax — ; — : — — — — — ... 

phonemee Visited Phonemes phoneme length v; ' * 

characterized by: It searches the subsequences that offer the maximization of one men- 
tioned confidence measures, over all possible matchings. 

characterized by: It allows the revaluation of the alternatives that offer the highest among 
any mentioned confidence measure on the basis of another confidence measure. 

characterized by: It computes the alternative that maximizes the 'simple normalization ' 
by using the method that we have called 'Iterative Viterbi Decoding' and that estimates 



the emission probability of the fiHer ; states/ in? an iterative manner, as being equal to the 
confidence measure in the previous '-iteration-.-'-- ^ 

characterized by: It computes the alternative that maximizes the 'simple normalization', 
'double normalization' or 'real fitting' using an' algorithm that considers the emission proba- 
bility of the filler state as zero. This method computes progressively, for each pair of sample 
and state of HMM, a set of possible alternatives paths to reach it. The computation of this 
set is based on the sets of paths that lead to the states that can be associated to the previous 
sample. 

This set can be reduced by using the given appropriate rules for the given confidence 
measure, ensuring the correctness of the inference. 

This set can be also reduced by using heuristics that are based on the aforementioned 
rules, for speeding up the computation despite the risk of reducing the theoretical quality of 
the recognition. 



Speech Recognition and Signal Analysis by straigh t 
Search of Subsequences with Maximal Confidence 

Measure. 

Dependent Claim 2. 

Preamble: 
It is based on the Claim 1. 

It estimates the existence of keywords and their position in utterances. 



characterized by: It uses the methods described in Claim 1, for recognition of subse- 
quences represented by Hidden Markov Models. 



Speech Recognition and Signal Analysis by straight 
Search of Subsequences with Maximal Confidence 

Measure 

Dependent Claim 3. 

Preamble: 
It is based on the Claim L 

It estimates the existence of biomolecular subsequences and their position in the chains 
of DNA using models like generalized profiles. 



characterized by: The estimation of their existence and position is made according to the 
methods described in the Claim 1, for recognition of subsequences represented by Hidden 
Markov Models. 



Speech Recognition and Signal Analysis by straight 
Search of Snhgpqn^n^ wffK M fV H im al j C onfidence 

M e asure 

Dependent Claim 4. 

Preamble: 

It is based on the Claim 1. It carries out the estimation of the existence of objects and their 
position in images. 



characterized by: It uses the methods described in Claim 1, for the recognition of subse- 
quences represented by Hidden Markov Models (HMM). 

characterized by: Sections through views of virtual objects are modeled by sets of Hidden 
Markov Models- 
characterized by: It uses a probabilistic model based on a distance computed between 
colors- 
characterized by: The Hidden Markov Models that model the objects can be structured 
of distinct regions, that play in the frame of the method the role of the phonemes. 

characterized by: The models of the objects can be modified in a dynamic manner 
with respect to the transition properties (existence and probability) on the basis of the 
accumulated information during the fitting process. 



Speech Recog ni tion and Signal Anoly o io by atraight 
Search of Sub s equences with Maximal Confidence 

Measure 

Abstract 

The invention belongs to the technical domain of decoding, classification, alignment and 
matching of data. 

The invention refers to new methods of keyword spotting in utterances, detection of 
subsequences in chains of organic matter (DNA) and recognition of objects in images. The 
- proposed methods search in an optimized way the matching that maximizes, over all the 
possible matchings, certain confidence measures based on normalized posteriors. Three such 
confidence measures are used, two are inspired from anterior work in Speech Recognition, 
and the third one is a new one. 

Application fields for this invention are: man-machine interfaces (using speech recogni- 
tion; ex: control systems, banking, flight services, etc), coordination systems (for industrial 
robots and automata) and development systems for pharmaceutic products. 
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